50 research outputs found

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Sub-word based language modeling of morphologically rich languages for LVCSR

    Get PDF
    Speech recognition is the task of decoding an acoustic speech signal into a written text. Large vocabulary continuous speech recognition (LVCSR) systems are able to deal with a large vocabulary of words, typically more than 100k words, pronounced continuously in a fluent manner. Although most of the techniques used in speech recognition are language independent, still different languages are posing different types of challenges. Efficient language modeling is considered one of the hard challenges facing LVCSR of morphologically rich languages. The complex morphology of such languages causes data sparsity and high out-of-vocabulary rates leading to poor language model probability estimates. The traditional m-gram language models estimated over the normal full-words are usually characterized by high perplexities and suffer from the inability to model unseen words that are more likely to occur in open vocabulary speech recognition tasks, like open domain dictation and broadcast news transcription. This thesis addresses the problem of building efficient language models for morphologically rich languages. Alternative language modeling approaches are developed to handle the complex morphology of such languages. This work extensively investigates the use of sub-word based language models using different types of sub-words, like morphemes and syllables, and shows how to carefully optimize their performance to minimize word error rate. In addition, the pronunciation model is combined with the language model through the use of sub-words combined with their context dependent pronunciations forming a set of joint units called graphones. Moreover, a novel approach is examined using extended hybrid language models comprising multiple types of units in one flat model. Although the sub-word based language models are successful in handling unseen words, still they suffer from the lack of generalization with regard to unseen word sequences. To overcome this problem, morphology-based classes are incorporated into the modeling process to support the probability estimation for sparse m-grams. Examples of such models are the stream-based and class-based language models, as well as the factored language models. A novel methodology is proposed, which uses morphology-based classes derived on the level of morphemes rather than the level of full-words to build the language model. Thereby, the benefits of both sub-word based language models and morphology-based classes are retained. Moreover, the aforementioned approaches are combined with the efficient state-of-the-art language modeling techniques, like the hierarchical Pitman-Yor language model which is a type of Bayesian language model based on the Pitman-Yor process that has been shown to improve both perplexity and word error rate over the conventional modified Kneser-Ney smoothed m-gram models. In this thesis, hierarchical Pitman-Yor models are used to estimate class-based language models with sub-word level classes. Recently, continuous space language models have shown significant performance improvements in LVCSR tasks. The continuous nature of such models allows for better levels of generalization due to the inherent smoothing capabilities in continuous space. One of the successful continuous models used in pattern recognition tasks is the feed-forward deep neural network with multiple hidden layers. This model can capture higher-level and abstract information about the input features. Recently, feed-forward deep neural networks have shown improved performance compared to shallow neural networks in many pattern recognition tasks. In this work, the use of feed-forward deep neural networks is explored to estimate sub-word based language models. In addition, word and sub-word level classes are used as inputs to the neural networks in order to improve probability estimation in cases of morphological richness. The methods applied in this work are tested on Arabic, German and Polish as good examples of languages having rich morphology. Experiments are conducted using the state-of-the-art LVCSR systems used by RWTH Aachen in GALE, Quaero, and BOLT research projects. The methods developed in this thesis reduce the word error rate by up to 7% relative compared to heavily optimized traditional approaches applied on very large vocabulary sizes, typically up to one million words

    INVESTIGATIONS ON THE USE OF MORPHEME LEVEL FEATURES IN LANGUAGE MODELS FOR ARABIC LVCSR

    Get PDF
    A major challenge for Arabic Large Vocabulary Continuous Speech Recognition (LVCSR) is the rich morphology of Arabic, which leads to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probabilities. In such cases, the use of morphemes rather than full-words is considered a better choice for LMs. Thereby, higher lexical coverage and less LM perplexities are achieved. On the other side, an effective way to increase the robustness of LMs is to incorporate features of words into LMs. In this paper, we investigate the use of features derived for morphemes rather than words. Thus, we combine the benefits of both morpheme level and feature rich modeling. We compare the performance of streambased, class-based and Factored LMs (FLMs) estimated over sequences of morphemes and their features for performing Arabic LVCSR. A relative reduction of 3.9 % in Word Error Rate (WER) is achieved compared to a word-based system. Index Terms — language model, morpheme, streambased, class-based, factore
    corecore